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Abstract 

Improvement guarantees for semi-supervised classifiers can currently only be given 
under restrictive conditions on the data. We propose a general way to perform semi- 
supervised parameter estimation for likelihood-based classifiers for which, on the full 
training set, the estimates are never worse than the supervised solution in terms of 
the log-likelihood. We argue, moreover, that we may expect these solutions to really 
improve upon the supervised classifier in particular cases. In a worked-out example 
for LDA, we take it one step further and essentially prove that its semi-supervised 
version is strictly better than its supervised counterpart. The two new concepts that 
form the core of our estimation principle are contrast and pessimism. The former 
refers to the fact that our objective function takes the supervised estimates into 
account, enabling the semi-supervised solution to explicitly control the potential im¬ 
provements over this estimate. The latter refers to the fact that our estimates are 
conservative and therefore resilient to whatever form the true labeling of the unla¬ 
beled data takes on. Experiments demonstrate the improvements in terms of both 
the log-likelihood and the classification error rate on independent test sets. 

Keywords: maximum likelihood, semi-supervised learning, contrast, pessimism, lin¬ 
ear discriminant analysis. 
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1 Introduction 


A century after its inception im parameter estimation through maximum likelihood 
(ML) is still one of the most widely used statistical estimation techniques. In a more 
rudimentary form, maximum likelihood can even be traced back as far as the 18th cen¬ 
tury [3], ML estimation is employed in fields as diverse as genealogy, imaging, genetics, 
astrophysics, physiology, and quantum communication, as is illustrated by many recent 
research works such as [5HIZ!. Moreover, new tools and techniques based on or related 
to ML are still being developed within modern statistics and related fields. Some recent 
examples are [T81 - I23] . A satisfactory approach to ML-based estimation for semi-supervised 
classifiers, however, has not been developed so far. 

In general, the aim of semi-supervised learning is to improve supervised classifiers by 
exploiting additional, typically easier to obtain, unlabeled data [SlESj- Up to now, how¬ 
ever, the literature has reported mixed results when it comes to such improvements; it is 
not always the case that semi-supervision leads to lower expected error rates or the like. On 
the contrary, severely deteriorated performances have been observed in empirical studies 
and theory shows that improvement guarantees can often only be provided under rather 
stringent conditions on the data we are dealing with [261 - 130] . 

In this work, we demonstrate when and how ML estimators for classification can be 
improved in the semi-supervised setting. We show that semi-supervised estimates can be 
constructed that are essentially closer to the estimates that would be obtained when also 
all the labels for all unlabeled data would be available in the training phase. That is, 
the semi-supervised estimates are closer to the estimates obtained with all labels available 
than the supervised estimates that rely on the same labeled instances as semi-supervision 
does, but that do not use the additional unlabeled data set. A crucial difference between 
the theory in this work and theories from, for instance, [26H30] is that the former can do 
without strict assumption on the data or the relation between the data and the classifier 
considered. In fact, as we will see, Theorem [2] in Section Q] especially relies on assumptions 
that are minimal and can be readily checked on the data at hand. Other results in semi- 
supervised learning resort to premises that generally cannot be conclusively tested for. 

In order to show the potential improvements semi-supervised classifiers can deliver, 
we introduce a novel, generally applicable estimation principle that extends likelihood 
estimation to the semi-supervised case in a consistent way. In particular, our method 
is contrastive , which refers to the fact that the objective function takes into account the 
original supervised solution in an explicit way. This enables the semi-supervised solution 
to explicitly control the potential improvements over the supervised solution. In addition, 
our method is pessimistic , which refers to the fact that the unlabeled data is treated as 
if it behaves in a worst kind of way, i.e., such that the semi-supervised estimates benefit 
the least from it. It makes the estimates conservative, but resilient to any possible state 
in which the unlabeled data can be encountered. We refer to this principle as maximum 
contrastive pessimistic likelihood estimation or MCPL estimation for short. 
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1.1 Outline 


In Section [3l the main theory is introduced, contrast and pessimism are further elucidated, 
and our core, general estimation principle, MCPL, is presented. In that same section, we 
also sketch the possibility of improved semi-supervised estimation by means of MCPL. 
Sections [4] and [5] provide a worked-out illustration and a further specification of our theory. 
The former section introduces the MCPL-based version of LDA, proves in what way the 
semi-supervised LDA parameters are expected to really improve over the regular supervised 
ones, and sketches the heuristic employed to tackle the related optimization problem. The 
latter section, Section [5j provides extensive results on a range of data sets, comparing 
regular supervised LDA and an earlier proposed semi-supervised approach to LDA [31] 
with the novel semi-supervised LDA introduced here. Section [F] puts the results in a 
somewhat broader perspective and raises some open issues. Finally, Section [7] concludes. 
To begin with, however, we put our work in context, provide some preliminaries, introduce 
ML estimation and LDA, give an overview of the principal related works, and discuss 
related earlier findings. 


2 Background and Preliminaries 


The log-likelihood objective function for a A'-class supervised classification problem takes 
on the general form 


N 

mx)=j2 log p{'Xi , ]ji\0) 

i= 1 

K N k 

=££ logp(x kj ,k\0 ), 

k= 1 j=l 


( 1 ) 


where class k contains a total of Nk samples, N = 


Nk is the total number of samples, 


X = (0 


N 
i =1 


is the set of all labeled training pairs with x t 


d-dimensional feature vector^], and 


Vi eC = { 1,...,/i} 


are their corresponding labels. Denoted with Xkj is the jth sample from class k G C. Here, 
every model parameter—specific to a particular class or not—is absorbed in 6 £ 0. The 
set 0 contains all parameter settings possible, thus defining the full class of models under 
consideration. Now, the supervised ML estimate, d sup , maximizes the above criterion: 

0 sup = argmax L(6\X). (2) 

ee© 

: As is also common in many mathematical statistics and analysis textbooks, plain italic lowercase 
letters may indicate vectors and not only scalars. 
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What follows is an overview of the main approaches to semi-supervised learning with a 
particular focus on likelihood-based methods. Specific attention will furthermore be given 
to semi-supervised approaches to LDA. For broader and more extensive literature reviews, 
we refer to and [32J. 


2.1 Self-Learning and Expectation Maximization 

With the current work, we in essence revisit a problem in ML estimation that has already 
been considered as early as the late 1960s. In 1968, Hartley and Rao sketched a general 
way of exploiting unlabeled data 

U = 

in likelihood estimation of model parameters for the analysis of variance J33]. The basic 
idea is to consider all possible labelings that the unlabeled data could have and choose that 
labeling that achieves the largest log-likelihood. As such, this procedure still relies on ML 
estimation, but where the fully supervised model would merely optimize the log-likelihood 
of the parameters of the model, here the unobserved labels 

v = i 


of the unlabeled data in U are considered parameters over which the likelihood is maximized 
as well: 


argmax 

eee 


M 

L(9\X) + max E logp(ui,Vi\9) 

i =1 


(3) 


Clearly, as the number of possible labelings grows exponentially with the number of 
unlabeled data points, even for fairly small sample sizes M this procedure is generally 
intractable. 

A learning strategy that is often referred to as self-learning or self-teaching approaches 
the problem in a similar though greedy way. In its most most simple form, the classifier of 
choice is trained on the available labeled data in an initial step. Using this trained classifier, 
all unlabeled data or part of it are assigned a label. Then, in a next step, this now labeled 
data is added to the training set and the classifier is retrained with this enlarged set. 
Given the newly trained classifier, one can relabel the initially unlabeled data and retrain 
the classifier again with these updated labels. This process is iterated until convergence, 
i.e., when the labeling of the initially unlabeled data remains unchanged. 

McLachlan [S3], in 1975, was probably the first to apply this procedure and indeed 
suggested it as a computationally more tractable alternative to the one in [33]. Similar 
procedures have been reintroduced throughout the last couple of decades (see, for instance, 
[S5H57] L Outside of the literature on likelihood estimation, a procedure reminiscent of 
McLachlan’s had already been proposed. In 1966, while dealing with an issue slightly 
different from semi-supervised learning, Nagy and Shelton proposed a general technique 
similar to self-learning [38]. One of the crucial differences is that the labeled data is only 
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used to train the initial classifier. It does not play a role in any of the subsequent self¬ 
learning iterations. Also this procedure has been reconsidered many years after it was 
initially suggested, e.g. in [135] , 

Possibly the best known semi-supervised likelihood-based approach treats the absence 
of labels as a classical missing-data problem and integrates out these nuisance parameters 
to come to a new, full model likelihood pTTTI ITT] 



Its maximization over 6 typically relies on the classical technique of expectation maxi¬ 
mization (EM) in which the estimates are not updated on the basis of hard labels, but 
rather using posterior probabilities, which can equivalently be thought of as soft labels or 
assignments. In 1973, [32] and [33] were possibly the first to consider this specific problem 
explicitly, though m had already employed such formulation in its applied work in 1972. 
A more modern overview of EM approaches to partial classification can be found in [32] . 

At a first glance, self-learning and EM may seem different ways of tackling the semi- 
supervised classification problem, but there are clear parallels. Indeed, where EM provides 
soft class assignments to all unlabeled data, self-learning just assigns every such instance 
in a hard way to one unique class in every iteration. In fact, [35] effectively shows that 
self-learners optimize the same objective as EM does. Similar observations have been made 

in m and m- 

The major problem with the aforementioned methods is that they can suffer from 
severely deteriorated performance with increasing numbers of unlabeled samples. This 
behavior, already extensively studied [5TM5I foU] . is often caused by model misspecification, 
i.e., the statistical class of models with parameters 9 is not able to properly fit the actual 
data distribution. We note that this is in contrast with the supervised setting, where most 
classifiers are capable of handling mismatched data assumptions rather well and adding 
more labeled data typically improves performance. The latter is in line with the behavior 
many misspecified likelihood models display [5T| . 

2.2 Density-Ratio Correction 

A rather different approach to semi-supervised estimation for likelihood-based models is 
offered in [52], in which the problem of semi-supervised learning is basically treated as one 
of learning under covariate shift [53] • Covariate shift is the setting in which the posterior 
distribution of the labels given the data, p(y |x), remains the same, while the marginal p(x ) 
might change when going from the training to the testing phase. Following 0. the main 
idea in [52] is that the marginal distribution over the feature space can be better estimated 
based on all data, both labeled and unlabeled. Subsequently, the density ratio between 
this estimate and the marginal estimate based on labeled data only can be exploited to 
weight the training data by means of their importance, as generally suggested in [53]. 
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In their work, the authors prove that, asymptotically, this semi-supervised learning 
procedure works better than its regular, supervised counterpart. Next to the fact that 
results hold only asymptotically, the behavior of this semi-supervised learner seems to de¬ 
pend strongly on the way the density ratio is determined. In the finite sample setting, 
one may run into similar kind of problems as those sketched in the previous subsection: 
choosing the incorrect model for estimating the density ratio of the marginal feature dis¬ 
tributions, could lead to deteriorated performance instead of performance improvements. 
Experimental results in both [52] and [51] seem to reflect this. 

2.3 Intrinsically Constrained Estimation 

In recent years, the author proposed an essentially different take on semi-supervised learn¬ 
ing [55][55j. On a conceptual level, the idea is that the available unlabcled data indirectly 
puts restrictions on the parameters possible, i.e., it basically allows us to look at a set that 
is smaller than the initial set O. A first operationalization of this idea has been studied 
for the simple nearest mean classifier (NMC, [55]). It exploits constraints that are known 
to hold for this classifier, defining relationships between the class-specific parameters and 
certain statistics that are independent of the specific labeling. In particular, for the NMC 
the following constraint can be exploited: 


K 

A = jlk ^ k ’ ( 4 ) 

k =1 

with /i the estimated overall sample mean of the data, fi k the sample means of the K classes, 
and 7 ik = jj- the estimates of the class priors. In the supervised setting this constraint is 
automatically fulfilled m- Its benefit only becomes apparent, therefore, with the arrival 
of unlabeled data that can be used to improve the label-independent estimate fi. Using 
this more accurate estimate results in a violation of the constraint. Fixing it by properly 
adjusting the p, k s, these label-dependent estimates become more accurate as well. 

Supervised LDA can be improved in a similar way. The same constraint in Equation 
© holds, but for LDA additional ones involving the class-conditional covariance matrix 
apply. Notably, we have that the covariance matrix of all the data, the total covariance E T , 
equals the sum of the covariance between the class means, the between-class covariance E#, 
and the class-conditional covariance matrix E (which is also referred to as the within-class 
covariance) E3- ... 

E t = E b + E . (5) 

These additional constraints further restrict the possible semi-supervised solutions, allow¬ 
ing for more significant improvements over the regular supervised classifier [Tlf5B]. 

The aforementioned works enforce the constraints imposed in a rather ad hoc way. A 
somewhat more principled constrained likelihood approach is suggested in [5^[59]. Gen¬ 
erally, given any constraint h(6) = 0 that the parameters of the semi-supervised classifier 
should comply with, the idea is to maximize the original likelihood from Equation ([T])— as 
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in Equation ([2]), but subject to the constraint, i.e., we solve 

argmax L(9\X) subject to h(9) = 0. 
eee 

Reference |59] shows, for instance, how to formulate the constrained NMC from [55j in 
this way. A major shortcoming of this approach is that such constraints must have been 
identified in the first place. For this reason, its applicability to other classifiers is currently 
limited. 

A second and more recent instantiation of our general idea coined in |55j does allow 
for broader applicability (6Q1ETJ- The optimization suggests to find those parameters that 
maximize the likelihood on the labeled data set X, but only allows solutions that can be 
achieved with a data set that includes labeled versions of the initially unlabeled instances 
as well. In terms of a likelihood formulation, what it suggests to solve is the following: 

argmax L(9\X) 
eeT 

with T = < argmax L(t\Xy) 

{ tee 

The first important ingredient is the set Xy, which is the labeled data set X augmented 
with the unlabeled data U combined with the labels in V. So 

X v = XU{(u i ,v l )}f =1 

is a fully labeled data set for all V G C M . The second important ingredient is the set T, 
which typically is a proper subset of the original parameter set 0. This set T contains all 
possible classifier parameters t that are obtained by training classifiers on all of the possible 
fully labeled data sets Xy. As we need to consider all possible labelings for the unlabeled 
data, this brings us back to Hartley and Rao’s intractable method [33]. In j60j and |6T], this 
problem is overcome by introducing the possibility of fractional or soft labels, resulting in 
a well-behaved quadratic programming problem for the case of the least squares classifier. 

Putting our earlier work further in the appropriate context, we should finally men¬ 
tion [62] and [63], where likelihood-based semi-supervised learning guided by particular 
constraints is considered as well. The crucial difference is that the constraints proposed in 
these works are typically derived from domain knowledge and very task specific. If these 
a priori constraints are correct, a learner can obviously benefit from them, even in the 
supervised case. If they are incorrect they may lead to severely deteriorated performance. 
So where these constraints are classiher-extrinsically motivated, any other method in this 
subsection relies on intrinsically motivated constraints, which are fixed as soon as the data 
is available and the choice of classifier is made. 

2.4 Supervised and Semi-Supervised LDA 

As our worked-out example in Sections 0] and [5] concerns LDA, this subsection turns to its 
associated likelihood and the specific semi-supervised solutions that have been proposed 
for this classical technique. 
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Compared to Equation (JTJ) , the log-likelihood objective function for K -class LDA takes 
on a more specific form. We can write 

L L da(0\X) = 

N 


^ log p{Xi, Vi |7Ti, 7 Ta-, , fl K , E) 

i =1 

K N k 

EE logp(x kj , k\n k , n k ,Y,) 


(7) 


k =1 j =1 
it iV fc 

=EE log7r fc ^(x fcj |/i fc ,E), 

fc=i j=i 

where 6 = (7Ti,..., 7 Tr-, pi,..., hk, E), n k the class priors, p k is the class means, and E 
the class-conditional covariance matrix. The g, on the last line, denotes the normal (or 
gaussian) probability density function. Of course, to find the supervised solution, we 
solve the maximization already noted in Equation (121) . which leads to the well-known ML 
estimates of the parameters of regular supervised LDA. 

Semi-supervised LDA has been considered both in theoretical and methodological work. 
The main example in Hartley and Rao’s work [33j treats univariate LDA in the semi- 
supervised setting. Also McLachlan 0 focusses on LDA. Following these contributions, 
other early studies of the use of unlabeled data in LDA can be found in [3011411165] and 
Self-learned and intrinsically constrained versions of LDA have been compared in 
and [31 j. 

Let us finally remark that various contributions from a large number of disciplines still 
employ classical, supervised LDA as their decision rule of choice. A handful of recent ex¬ 
amples from the applied and natural sciences can be found in some of the earlier-mentioned 
references: [SHH] • Semi-supervised versions of LDA, however, have not been widely applied. 
The general shortcoming mentioned in Subsection 12.11 the fact that self-learned and EM 
versions can give sharply inferior performance, probably contributes to this. 


3 Contrastive Pessimistic ML 

For none of the aforementioned semi-supervised learning schemes and classifiers, there are 
currently any generally applicable guarantees when it comes to performance improvements, 
unless one makes strong assumptions about the data. The learning strategy that we devise 
in this section does allow for such a guarantee on the training set in a strict way. This we 
will show in Sectional The main, general theory is provided in the current section. 
Consider the fully labeled data set 


Xy.=XU 



It is similar to X v considered in Subsection 12.31 but we now assume that V* contains the 
true labels v* belonging to the feature vectors in U. Define 

9 opt = argrnax L{9\X V *), 
e»e© 

which gives the classifier’s parameter estimates on the full training set in which also the 
unlabeled data is labeled. With respect to this enlarged training set X v *, the estimate 9 opt 
is optimal by construction and cannot be improved upon. As the supervised parameters 
in 9 sup are estimated merely on a subset X of X v *, we have 

L(L p \Xv*) < L0 opt \X v *). 

In the semi-supervised setting, both X and U are at our disposal, but V* has not 
been observed. We have more information than in the supervised setting, but less than in 
the optimal, fully labeled case. The principal result obtained in this section is that, for 
likelihood-based classifiers, semi-supervised parameter estimates 9 sem i obtained by means of 
MCPL are essentially in between the corresponding supervised and the optimal estimates: 

L(L p \ x v .) < L(9 semi \X v ,) < L(9 opt \X V .). 

In itself, this result might not seem all too helpful as we can easily come up with a semi- 
supervised parameter estimate for which these inequalities are trivially fulfilled: take # S emi 
to equal 9 sup . However, we first want to clarify that the inequality holds generally for MCPL 
before we proceed and make the claim that strict improvements by means of MCPL over 
regular supervised estimation can be expected. That is, we argue, at least for particular 
classifiers, that 

L(9 sup \X v *) < L{9 sera \\Xy , 

i.e., the log-likelihood on the fully labeled set Xy obtained by the semi-supervised esti¬ 
mates is strictly larger than that obtained under supervision. For LDA, this is proven in 
Section 


3.1 Contrast and Pessimism 


To be able to construct a semi-supervised learner that improves upon its supervised coun¬ 
terpart, we take the supervised estimate into account explicitly and consider the difference 
in loss incurred by 9 sem[ and 9 sup . 

Before doing so, however, we first introduce some notation. We define q^i to be the 
hypothetical posterior P(k\v,j) of observing a particular label k given the feature vector 
Ui . We may interpret the as soft labels for every tq and will also refer to them as such. 
This respects the fact that classes may be overlapping and not every iq can be be assigned 
unambiguously to a single class. By definition, J2k^c9 ki = 1- More precisely, we can state 
that the lb-dimensional vector q.j is an element of the K — 1-simplex X K _\ in M A : 


\T r- 


q.i G Ajf-i — < (pi . . . Pk) G 


K 1 

^2 Pi = i ,Pi > ° f 

i=i * 


9 




Provided that these posteriors are given, we can express the log-likelihood on the complete 
data set for any 9 as 


M K 


L(9\X,U,q) = L(9\X) + EE Qki log p(u i} k\9 ), 


( 8 ) 


i =1 k =1 


in which the dependence on the q^s is explicitly indicated also on the left-hand side by 
means of the variable q. Note that use of these soft labels in q allows more flexibility than 
just using a set of hard labels V E C M , such as was for instance done in Equations ([2]) and 
©• 

For a given q, the relative improvement of any semi-supervised estimate 9 over the 
supervised solution can now be expressed as follows: 


CL(9,9 sup \X,U,q) =L(9\X,U,q ) 

~L(9 sup \X,U,q). 


(9) 


This contrasts the semi-supervised solution with the regular supervised solution obtained 
on the data set X , enabling us to explicitly check to what extent semi-supervised improve¬ 
ments are possible in terms of log-likelihood. As we are dealing with a semi-supervised 
problem, q is unknown and we cannot use Equation (JUJ) directly for optimization. The 
choice we make now is the most pessimistic one: we are going to assume that the true 
(soft) labeling is most adverse against any semi-supervised approach and consider the q 
that minimizes the gain in likelihood. That is, our objective function becomes 


CPL(9,9 sup \X, U) = 

min CL(9,9 sup \X, U, q ), 

K-i 

where A^_ x = the Cartesian product of M simplices. 


( 10 ) 


3.2 MCPL Estimation 

We are now ready to define MCPL estimation, which extends general likelihood estimation 
for supervised learners to the general semi-supervised case. 

Definition 1 (MCPL). Let 9 SUp be the supervised ML estimate maximizing L{9\X) and let 
U be a set of unlabeled data. A maximum contrastive pessimistic likelihood estimate, # semi; 
is an estimate that maximizes the criterion CPL(9,9 sup \X,U) in Equation HTh) . i.e., 


$semi = argmax CPL(9 , 9. 
060 


sup 


| A, U) 


( 11 ) 


Maximizing the objective function CPL for 9 leads to a rather conservative estimate, 
because of the pessimistic choice of q. But we need this choice, in combination with the 
contrastive nature of the objective function, to be able to guarantee that the following 
holds. 
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Lemma 1. 


( 12 ) 


HL ,|*v) < L(i m i\X v .) < L(0 m \X V .). 

To see that the lemma indeed holds, consider Equation dill) . Because we can take 
6 = 0 SUp , 0 is always among the minimizers in this equation. As a consequence, the 
maximum will never be smaller than 0: 

max CPL{9 , 4u P | V U) > 0 . 

Looking at Equation (jUJ), this means that the difference between the semi-supervised and 
the supervised log-likelihood is larger than 0, but as this holds even for the worst choice 
of q, it must also hold for the true hard labeling considered in Xy *. From this, the first 
inequality follows in Equation (TT2]h which shows the lemma to hold. 

3.3 Prospects of Improved Estimates 

If we can show for a classifier that we can expect the inequalities in Lemma Q] to be strict, 
then we can conclude that the semi-supervised parameter estimates are essentially better 
than those obtained under supervision. When can we expect this to happen? There are 
at least two different ways. 

Firstly, a semi-supervised classifier can be better if the true underlying soft labeling is 
less adversarial than the worst-case that is considered in MCPL estimation. Even though 
we cannot give any general quantitative statement on how often this happens, we can 
imagine that this is quite likely. Secondly, we can expect improvements in case the set of 
feature vectors of the labeled instances, X , is an ill representation of the complete set of 
labeled and unlabeled data, X and U. It is clear that nothing can be gained in the other 
extreme, where the feature vectors in U are just exact copies of those in X. In that case, 
MCPL estimation would just recover the supervised estimate. In the next section, we use 
such ill-representation argument to show that semi-supervised LDA typically outperforms 
its supervised counterpart. 

4 MCPL Version of LDA 

Combining MCPL estimation as defined in Subsection 13.21 with the log-likelihood formu¬ 
lation of regular supervised LDA from Equation ([7} leads to our proposal of a proper 
semi-supervised version of LDA. Following the previous section, we have 

£lda(#su P |-Av*) < T LD a(# semi | Xy,). 

Here and in what follows, the subscripted LDA makes explicit that we are specifically 
considering this classifier. Subsection 14.31 briefly presents the heuristic we used to carry 
out the necessary maximinimization to actually obtain 6 Umi . But first, in the next two 
subsections, we demonstrate that we can expect improved semi-supervised estimation. 
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4.1 Preliminaries 


As the set of normal densities g(x |/ifc, E) makes up an exponential family, it can be repa¬ 
rameterized into a so-called canonical parametrization such that it is concave in its pa¬ 
rameters [67t|68]- Denote this reparametrization by d. For fixed q, Llda^IA, U, q) is also 
concave. Now, by definition of the MCPL estimate 


max C'FLlda(^,^su P |-A, U) = 

max min CL LDA ({},^ sup \X } U, q) = 
«ee qe 


max 

t?€© 


min 

1 


L lda (V\X, U, q ) - L LDA (V sup \X, U, q) 


From this, it is not difficult, to see that for fixed q, CLlda is concave in $ and for fixed 
CLlda is linear in q. So CLlda is in fact concave-convex on 0 x A^_ x . In addition, 

A k-i is compact and so we can invoke the important minimax corollary by Sion [69] that 
allows us to interchange the maximization and minimization, which in turn means that the 
solution to the above maximinimzation is a saddle point uni. Moreover, the estimate r? sem i 
is unique if CLlda is strictly concave in $ unj. This is ensured if E is positive definite. 
From Equation (flffi) in Subsection 14.21 it follows that this holds, for instance, if E sup is 
positive definite. Equivalently, we will assume the supervised estimation problem to be 
well-posed. 

For normal distributions, both the standard parametrization and the canonical parametriza¬ 
tion are complete parameterizations. We have [67]: d = $( 9 ) = (E _1 /x, triu(—E -1 )), where 
triu(A) returns the upper triangular part of the square matrix A. As we consider well-posed 
estimation problems, E is invertible and so the mapping between 9 and d is a bijection 
(cf. HU). So coming back from the canonical parametrization d to our original 9 , we see 
that the maximinimzation also leads to a unique solution for 9 sem; . This will be important 
in what follows. 


4.2 Semi-Supervised Improvements 

We consider CLlda^, #sup|AT, U, q), which is Equation ([9]) with the particular choice of the 
likelihood from Equation (I7|) . Leaving q fixed, we saw that there is a unique maximizer 
for CLlda- Fixing q, the supervised part of the contrastive likelihood does not play an 
essential role in the objective function. It merely provides an offset, and the maximizer of 
CL L da is equal to the maximizer of L lda (9\X } I/, q). Now, the latter is a weighted version 
of standard LDA—the weights are provided by q —and it is not difficult to show that, for 
every class k e C, the optimal ML parameter estimates are given by 




fik 


A T k A 1 9ki 

N + M ’ 
j =1 J'kj T" / A— i qki'Ui 
Nk + 5 ^ 2=1 Qki 


(13) 
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while the estimate of the average class-conditional covariance matrix becomes 


K r N k 


E = 


N + M 

M 


^ ^ ^ ^ {'L'kj +k^)i,^kj f^k) 


k=l L j =1 
^ ^ Qki^^i hk) 


i= 1 


Note that the total data mean equals 


r N 


psemi _ 

r 


N + M 


M 


L i =1 2—1 


^2 


( 14 ) 


(15) 


which is independent of the soft labels q. We now additionally note that also for weighted 
LDA, for any choice of q, the constraint in Equation (j4]) holds. The MCPL solution 0 sem j 
will have corresponding pessimistic soft labels q semi and therefore satisfies the constraint 
as well: /i semi = £f =1 mi /2 s fe emi . 

Now, if semi-supervised learning does not improve over the supervised estimate, 9 sem ; 
should equal the initial supervised solution 0 sup , because the estimate is unique (see Sub¬ 
section [4TT]). This, in turn, implies that we also have fi semi = J2k=i ^fc UP Afc UP - But as the 
supervised solution is trained on X only, it should simultaneously fulfil the constraint in 
Equation (J3J) with the total data mean equal to 


/i sup = 



2 — 1 


(16) 


i.e., the sample average of A'. We therefore have: 


r p = E *rx up = a ”” 1 

k =1 


If the feature vectors of our classification problem come from a continuous distribution 
then, unless U is empty, the probability that fi sup equals fi semi is zero. This, in turn, 
implies that we can expect 0 semi to be different from 9 sup and, therefore, improve upon it. 
With this, we have proven our first main result concerning semi-supervised LDA. 

Theorem 1. If the supervised estimation problem is well-posed, M > 1, and if the feature 
vectors are continuously distributed, the strict inequality 

^LDa(^ semi |Ay*) > 

-^LDA^supl^V*) 


holds almost surely. 
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We should note that if the feature distribution is discrete, the inequality holds with a 
probability smaller than one. Nonetheless, when either the number of discrete elements of 
the distribution, the number N of labeled points, or the number M of unlabeled feature 
vectors is large, the probability that the inequality is strict typically gets close to one. We 
dare to conjecture that Theorem |T] will be accurate for many practical purposes, even in 
the discrete case. 

What we can say in the discrete case is that the probability that fi sup does not equal 
/i semi is nonzero and, therefore, we at least have strict improvement in expectation. 

Theorem 2. If the supervised estimation problem is well-posed and M > 1, we have 

E[L^ A (9 ovt \X v *)\ > 

E [TldA (^semi | Ay* )] > ^[^LDA^supl^V*)] > 

where the expectation is taken over U. 

Hence, LDA parameter estimation by means of MCPL is, in the average, always better 
than classical supervised log-likelihood estimation. 

4.3 Solving the Maximinimization 

As was discussed in Subsection 14.II already, the objective function, as provided by Equation 
m, i s linear in q and strictly concave in 9. As a result, we know that we are looking for 
a saddle point solution with a unique optimizer for 9. Moreover, we know there are no 
other local saddle point solutions for this maximinimization problem m- The basis of our 
heuristic to come to an MCPL estimate for the parameters of semi-supervised LDA are 
the following two steps between which the optimization alternates. 

1. Given a soft labeling q, the optimal, maximizing LDA parameters 9 are estimated by 
means of Equations (ITT]) and (fTTjl . 

2. Given LDA parameters 9 , the gradient V for q is calculated, and q is changed to 
q — «V, with a > 0 the step size. The following should be noted: 

(a) q — aV is not guaranteed to be in A^_ 1; so we project back into this set in 
every iteration ra; 

(b) the objective function is linear in q, so the gradient V is easily obtained: 

= log7T k g(x ki \n k ,E ) 

log 7T k supd^ki | hk sup; Ssup) ) 

(c) we want to minimize for q, so we change its value in the direction opposite of 
the gradient, i.e., with —a. 
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In our experiments in Section 0 the step size a is decreased as one over the number of 
iterations. Furthermore, we limit the maximum number of iterations to 1000. In addition, 
if the maximin objective does not change more than 10” 6 in one iteration, the optimization 
is halted. With these settings, in our experiments, the maximum number of iterations is 
reached seldom (in less than one in every thousand cases). 

Finally, we remark that care should be taken when calculating the necessary log- 
likelihoods or any of the related quantities. For example, the logarithm of the determinant 
of the average class covariance matrices can, especially for moderate- and high-dimensional 
problems, easily results in numerical infinities. Fairly reliable results can, in this instance, 
be obtained by determining the singular values of the covariance matrix through an SVD 
and taking the sum of the logarithm of these values. 

5 Experiments and Results with LDA 

Having presented the specific theory for semi-supervised LDA and a heuristic approach to 
find its MCPL parameters in Section SI there are four main issues we want to investigate 
experimentally. To start with, the theory states that semi-supervised LDA estimates are 
better on the training data at hand given the log-likelihood as the performance measure. 
The two questions this raises are, firstly, how do these estimates compare to the supervised 
estimates on new and previously unseen test data? And secondly, how do they perform 
and compare in terms of the 0-1 loss, i.e., the classification error? Concerning the second 
point, we remark that the relation between likelihood and error rate is not necessarily 
monotonic and a higher likelihood does not necessarily lead to a lower error. It is only 
in recent years that considerable effort has been spent on understanding the nontrivial 
relationship between the criterion a classifier optimizes (here the likelihood) and how that 
classifier performs in terms of any other criterion of interest (here the error rate). Refer, 
for instance, to [75H7S] , Thirdly, we measure the log-likelihood for the various parameter 
estimates also on the training set. This gives us a basic check on the performance of our 
optimization heuristic: we should find that the semi-supervised solutions never deteriorates 
the supervised solution and typically even improves upon it. The final, fourth point is 
to compare our theoretically underpinned method to the semi-supervised LDA technique 
from [3Tj, which enforced the constraints in Equations 03]) and (]3]) in an ad hoc way. It 
puts our novel method in a broader perspective, as the earlier method has been studied 
extensively already. Among others, this constrained LDA has been shown to perform 
much better than self-learning or EM approaches to LDA and to be competitive with 
transductive SVM m and even entropy regularized logistic regression [[80], especially in 
the small sample setting. 

5.1 Data Sets and Preprocessing 

We chose 16 data sets from the UCI Machine Learning Repository [8T] to perform our 
experiments on. The full names can be found in Table |T] The same table contains ab- 
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Table 1: Full names and abbreviations of the 16 data sets from [SI]. Requested references 
are also included. 


full data set name 

abbreviated 

cit. 

banknote authentication 

banknote 


climate model simulation 

climate 

m 

crashes 



first-order theorem proving 

first-order 

E8.3] 

gas sensor array drift 

gas 

m 

landsat satellite 

landsat 


letter recognition 

letter 


low resolution spectrometer 

low 


magic gamma telescope 

magic 


miniboone particle 

miniboone 


identification 



optical recognition of 

optical 


handwritten digits 



pen-based recognition of 

pen-based 


handwritten digits 



qsar biodegradation 

qsar 


shuttle 

shuttle 


skin segmentation 

skin 

m 

spambase 

spambase 


spectf heart 

spectf 



breviated names that we use to refer to these sets in other tables and throughout the 
text. 

A main criterion for choosing these particular data sets was their size. We wanted to be 
able to easily generate labeled and unlabeled training sets from them pins independent test 
sets and we wanted especially the last two sets to have a fair size. In addition, we wanted 
to limit the computational burden and therefore did not choose too high-dimensional sets. 
Moreover, in order to rid ourselves of potential problems with singular class-conditional 
covariance matrices (which would leave the supervised estimation problem ill-posed) or 
numerical challenges related to this, the complete data sets were preprocessed in the fol¬ 
lowing way. In a first step, the variance of every individual feature was normalized to one. 
A feature was removed altogether if its variance was numerically zero. In a second step, 
PCA was applied to the full sets and 999%c of the variance was retained in order to remove 
linearly dependent features. We note that reducing the dimensionality essentially changes 
the likelihood of a data set, but that any nonsingular linear transformation merely offsets 
the log-likelihood attained by LDA. 

Table [2] provides various statistics for the 16 data sets. It also indicates, in the last 
column, which 6 of the 16 data sets consist purely of discrete feature values. The fourth-to- 
last to second-to-last column in the table gives the different sizes of labeled (N), unlabeled 
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(M), and test sets we used in every run of our experiments. We do not expect much gain 
from employing unlabeled data if the number of labeled points is large. We therefore kept 
the labeled set small, choosing a size of twice the dimensionality plus once the number of 
classes: 2d + K. We also took care that every class has at least one labeled instance in the 
training set. The remainder of the data was then randomly divided in two, more or less, 
equally sized sets that make up the unlabeled and test sets, respectively. 

5.2 Performance Criteria and Results 

With the labeled, unlabeled, and test sets as described above, we determined d sup , d K Prn \, 
and 6 0 pt- In addition, we calculated #hoc, which are the parameters of the constrained LDA 
estimated by means of the more ad hoc procedure in [3T|. For 9 opt , we of course had to use 
the true labels belonging to the unlabeled data. The parameters in #h oc can be estimated 
in closed form. For details, we refer to the original work in ED. 

For every data set the experiments were repeated 1000 times. Using the estimates 
0 sup , ^semi, and 9 op t, we calculated the following twelve criteria based on the log-likelihood 
for Table El the three average log-likelihoods (denoted L sup , L gemi , and L opt ) on the in¬ 
dependent test data; the same three average log-likelihoods on the labeled plus unlabeled 
data, i.e., the training data X v *] the percentage of times that the log-likelihood of the 
semi-supervised learner is strictly larger than the log-likelihood of the supervised learner 
(Tup 1 ’ rea d: semi-supervised over supervised); the percentage that the log-likelihood of the 
optimal classifier is strictly larger than the semi-supervised one (this number, denoted JjO, 
as well as the previously defined s s e ™ are calculated both on the test and the training set); 
and finally we expressed the relative improvement of the semi-supervised approach over 
the supervised approach in comparison with the optimal estimates by kpsnAsiL Again 
this is done both on the test and the training set. The same quantities are also calculated 
for the corresponding error rates £ sup , £ sem i, and £ opt (see Tabled]), with the only difference 
that we check numbers to be strictly smaller, instead of larger, to determine bCmi and opt .. 
Finally, Table [5] contains averaged log-likelihoods Lh oc and error rates £h oc , both on training 
and test sets, for the more ad hoc semi-supervised approach. Similar to those in Tables 
E] and El in the last four columns, comparisons to the corresponding log-likelihoods and 
classification errors of the supervised and our novel semi-supervised approach are made. 

A permutation test on all different paired results both for the four log-likelihoods 
Tsup, L semil L opt , and L hoc and the four errors £ sup , £ semi , £ op t, and £ ho c showed that for 
almost all cases we cannot retain the hypothesis that their averages are the same (at 
p -C 0.001). There are a few exceptions though. For the test error rates £ sup and £ semi on 
spectf, we cannot reject the null hypothesis of equality of expectation (at p = 0.68). On 
optical and qsar there is no statistically significant difference between L semi and L opt for 
the test log-likelihoods (at p = 0.01 and 0.50, respectively). Finally, L sup and Lh oc are, 
both in training and testing, not significantly different on shuttle (at p = 0.25 and 0.25) 
and spambase (at p = 0.76 and 0.99), while £ sup and £h oc are not significantly different 
on skin (at p = 0.03 and 0.03). For easy reference, the related performance numbers are 
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underlined in the respective result tables. 


6 Discussion 

6.1 Guarantees on the Training Set 

The results in Table |3]show that, on the training set, MCPL-based semi-supervised LDA is 
in between the regular supervised and the optimal estimate. That this happens to be the 
case in a strict sense, in all experiments we carried out, can be most readily deduced from 
the values under bCmi and opt . on the training set. These numbers equal 100.0 in all cases. 
This, in turn, indicates that in all of the 16,000 experiments we ran, the strict inequality 
from Theorem [Q was satisfied. Even for the discrete data sets this holds true, which was to 
be expected, given the number of different discrete vectors these data sets take on. Spectf 
has the smallest number, 267, implying that every feature vector in spectf is unique. With 
267 distinct values, chances are indeed very small that the means from Equation (1131) and 
(g6D coincide. 

6.2 Likelihood Behavior on the Test Set 

The aforementioned guarantees are on the training set that includes the unlabeled samples 
in U, but of course we are interested in the performance on independent test data as well. 
We are unaware of any theoretical results for the log-likelihood that provide a precise 
connection between performance on the training set and the test set, though we do expect 
that with more training data the likelihood of the supervised model on the test set becomes 
better in expectation. We need to consider such improvement in expectation, simply 
because, for a single instantiation of a classification problem, we might be unlucky in our 
draw of training or test set. In contrast with the situation in the training phase, we can 
therefore only get improvements in the average. Comparing the test log-likelihood in Table 
[3] for the supervised method with the one for the semi-supervised approach, we see the same 
as on the training data: for every data set, L sup is smaller than L semi . Also if we look at 
Tup 1 ’ we see that there are only two cases out of 16,000 in which the supervised estimate 
was better: we find a percentage of 99.8 instead of 100.0 on miniboone. 

The story is different, however, if we compare the semi-supervised and the optimal 
estimates. First of all, s ° pt i; indicates that, on the independent test set, the semi-supervised 
estimate is better than the optimal one in about 5% of the cases. In itself, this does not have 
to be at odds with what we expect for the likelihood, as it concerns the number of wins or 
losses and not the average log-likelihood. Our results on gas, optical, and qsar, however, 
indicate that also when it comes to the expected log-likelihood, # sem i may outperform 9 opt . 
Only the result on gas is statistically significant though. Moreover, the differences are 
anyway relatively small, as also the second-to-last column in Table [3] illustrates, where we 
find values basically equal to 1 for these sets. 

Regarding the log-likelihood, we generally note the following. Overall, the relative 
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improvements, as provided in the last two columns of Table [3] are considerable, sometimes 
enormous even. None of them is lower than 0.9 and many are virtually 1. This shows that 
the semi-supervised log-likelihood is, relative to the supervised value, very close to the 
optimal estimate. The immense improvements are probably explained by the fact that the 
averaged class-conditional covariance matrix £ is much more stably estimated in case of 
semi-supervision. The supervised estimate relies on N = 2d + K samples, while the semi- 
supervised estimate, as can be readily seen from Equation f|14|l . is based on all N+M in the 
training set. In our experiments IV + M is considerably larger than N. The latter is only 
slightly larger than twice the dimensionality, resulting in unstable covariance estimates. 
Clearly, the extreme difference in behavior for the various estimates will disappear with 
increasing numbers of labeled data. 

6.3 Error Rates 

Unlike the log-likelihood, the 0-1 loss is bounded and the differences and relative improve¬ 
ments stated in Table [4] are not that large. In almost all cases, £ semi is smaller than e sup 
and £ opt is smaller than £ semi in turn. On the test set, the maximum relative improvement 
reported is 0.426 on optical, with a good second of 0.415 on shuttle. 

There are three settings, however, in which no improvements of semi-supervised over 
supervised learning are attained: the first one is on the training set for low and the two 
others are in the training and test phase for spectf. In all cases, L sem i is better than 
L sup . So we have the, possibly, somewhat counterintuitive behavior that the estimates 
improve in terms of the expected log-likelihood, but that the expected error rate still 
deteriorates. Similar phenomena for other classifiers have been described in [741175] . where 
simple artificial examples are provided of how such behavior can be realized. It is a glimpse 
of the earlier mentioned difficult interrelationship two different performance criteria can 
display [ T3U76H75] , which we alluded to earlier on in Section 0 We checked the learning 
curves for low and spectf and they just showed the regular behavior: with increasing 
labeled sample sizes, the expected error rate of the supervised classifier decreases. 

Finally, we remark that the increase in error rate going from the training to the test 
set is less for the semi-supervised classifier than for the supervised one. This shows that 
the semi-supervised classifier is less overtrained on the training set than supervised LDA. 

6.4 Comparison to Constrained LDA 

Looking at Table [5] we see that also the ad hoc approach can work well. Especially when 
looking at the likelihood and comparing it to the supervised estimates, we see that, both 
on the training and the test set, the estimated likelihood is often better than the one 
obtained by the regular supervised parameters. The reason for the constrained approach 
to often be so much better than the supervised approach is probably similar to the one 
given in Subsection 16.21 to explain why the new approach comes so close to the optimal log- 
likelihoods. The large improvements are probably due to the fact that the averaged class- 
conditional covariance matrix £ is much more stably estimated in case of semi-supervision. 
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The estimated covariance matrix might still not be very good, but at least it is substantially 
better than the volatile and not so well conditioned supervised estimate. Nonetheless, the 
novel approach clearly outperforms the more ad hoc technique in most of the cases where 
the likelihood is concerned. In fact, compared to the constrained approach, MCPL provides 
the best average test log-likelihood on all data sets. The only expected log-likelihood that 
is worse during training is the one for spectf. 

Looking at the error rate, we see that the ad hoc procedure does very bad on optical 
and shuttle (the reason for this remains as yet unclear). Still, d^ oc leads to the best error 
rate on the test set on seven data sets. On the other nine data sets 0 semi turns out to be 
preferred. 

6.5 MCPL for Other Classifiers 

MCPL is proposed as a general estimation principle, which delivers semi-supervised es¬ 
timates that are at least as good as the regular supervised parameter estimates for any 
log-likelihood based classifier. To come to results such as Theorems [T| and [2] additional 
knowledge about the class-conditional distributions is needed. Because they are very sim¬ 
ilar to LDA and the same kind of mean constraints hold, classifiers for which it is almost 
immediate that strict or expected improvements can be obtained through semi-supervision, 
are the NMC (nearest mean classifier), quadratic discriminant analysis (QDA), and all 
kinds of kernclized or flexibilized versions of NMC, LDA, and QDA [88]. We speculate 
that also many classifiers constructed on the basis of exponential families mm allow 
for theorems making equivalent statements. These include, for instance, the Bernoulli, 
multinomial, and exponential density. 

Another interesting group of classifiers to study in the context of MCPL is that for 
which every class may consist of a mixture model. As the analysis of mixture models is 
in itself already rather difficult [89]—for one, the likelihood function is not concave, such 
classifiers may be outside the reach of any helpful theoretical analysis. We do, however, 
expect to benefit, if only from the regularizing effect our semi-supervised approach has, 
similar to the situation mentioned at the end of Subsection 16.21 What does seem a problem 
still, is to find an appropriate solution to the optimization that needs to be carried out in 
order to find an MCPL estimate. It seem worthwhile, though, to try to get to the nearest 
saddle point that can be found by means of a combined gradient ascent (in 9) and descent 
(in q). 

Finally, we could try to extend our work to classifiers that do not rely on likelihood 
models. One possible path may be through [90], which presents a decision-theoretic inter¬ 
pretation of maximum entropy and considers generalized concepts of entropy that relate 
to a much broader class of loss function than merely the (negative) log-likelihood. Though 
the link with this work is certainly not one-to-one, it may be possible to interpret our 
contrastive loss as a form of relative entropy and to make use of the results in [90] . 
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7 Conclusion 


We presented a well-founded approach to likelihood-based semi-supervised learning. Our 
principle of maximum contrastive pessimistic likelihood (MCPL) estimation is generally 
applicable to supervised classifiers whose parameters are estimated by means of a maxi¬ 
mization of the likelihood. Moreover, under certain concavity assumptions, improvements 
of the semi-supervised estimates can be expected and, in particular cases, even be guar¬ 
anteed. A worked-out illustration based on classical LDA demonstrates the significant 
improvements that can be obtained by our novel approach. 
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Tabic 2: Basic data set properties: number of objects, dimensionality of the original feature vectors, dimensionality after 
PCA (d), number of classes K , sizes of the largest and the smallest class, number of labeled (IV), unlabeled (M), and test 
objects in every run of our experiments, and whether features are purely discrete. 


data set (abbr.) 

^objects 

dim. 

PCA /d 

K 

largest 

(%) 

smallest 

(%) 

N 

M 

#test 

discr. 

banknote 

1372 

4 

4 

2 

762 

(55.5) 

610 

(44.5) 

10 

681 

681 

no 

climate 

540 

18 

18 

2 

494 

(91.5) 

46 

(8.5) 

38 

251 

251 

no 

first-order 

6118 

51 

41 

6 

2554 

(41.7) 

486 

(7.9) 

88 

3015 

3015 

no 

gas 

13910 

128 

60 

6 

3009 

(21.6) 

1641 

(11.8) 

126 

6892 

6892 

no 

landsat 

6435 

36 

33 

6 

1533 

(23.8) 

626 

(9.7) 

72 

3182 

3181 

yes 

letter 

20000 

16 

16 

26 

813 

(4.1) 

734 

(3.7) 

58 

9971 

9971 

yes 

low 

531 

93 

70 

10 

90 

(16.9) 

4 

(0.8) 

150 

191 

190 

no 

magic 

19020 

10 

10 

2 

12332 

(64.8) 

6688 

(35.2) 

22 

9499 

9499 

no 

miniboone 

130064 

50 

11 

2 

93565 

(71.9) 

36499 

(28.1) 

24 

65020 

65020 

no 

optical 

5620 

64 

61 

10 

572 

(10.2) 

554 

(9.9) 

132 

2744 

2744 

yes 

pen-based 

10992 

16 

16 

10 

1144 

(10.4) 

1055 

(9.6) 

42 

5475 

5475 

yes 

qsar 

1055 

41 

38 

2 

699 

(66.3) 

356 

(33.7) 

78 

489 

488 

no 

shuttle 

58000 

9 

6 

7 

45586 

(78.6) 

10 

(0.0) 

19 

28991 

28990 

yes 

skin 

245057 

3 

3 

2 

194198 

(79.2) 

50859 

(20.8) 

8 

122525 

122524 

no 

spambase 

4601 

57 

56 

2 

2788 

(60.6) 

1813 

(39.4) 

114 

2244 

2243 

no 

spectf 

267 

44 

43 

2 

212 

(79.4) 

55 

(20.6) 

88 

90 

89 

yes 










Table 3: Results calculated based on the log-likelihoods from the 1000 experiments per data set for the supervised and 
our semi-supervised approach. Refer to Subsection 15.21 for a description of the various criteria determined. 


data set 

(abbr.) 

estimated on test 

-^sup -^semi -^opt 

estimated 

-^sup 

on full train 

-^semi -^opt 

% test wins 

semi opt 

sup semi 

% trn. 
semi 
sup 

wins 

opt 

semi 

Z'semi 

i'sup 

i'opt - 

test 

-Z/sup 

trn. 

banknote 

-11.7 

-4.72 

-4.51 

-11.5 

-4.69 

-4.48 

100.0 

98.4 

100.0 

100.0 

0.971 

0.970 

climate 

-34.1 

-26.5 

-26.2 

-32.6 

-25.8 

-25.5 

100.0 

100.0 

100.0 

100.0 

0.964 

0.961 

first-order 

-1.88e+03 

-62.6 

-60.3 

-1.78e+03 

-40.4 

-39.2 

100.0 

100.0 

100.0 

100.0 

0.999 

0.999 

gas 

-4.46e+04 

-4.4e+03 

-4.41e+03 

-4.37e+04 

-13.1 

-12.4 

100.0 

44.8 

100.0 

100.0 

1.000 

1.000 

landsat 

-33.2 

-4.64 

-3.73 

-32.4 

-4.35 

-3.42 

100.0 

100.0 

100.0 

100.0 

0.969 

0.968 

letter 

-63.6 

-22.3 

-18.4 

-63.3 

-22.2 

-18.3 

100.0 

100.0 

100.0 

100.0 

0.914 

0.913 

low 

-90.1 

-19.8 

-17.6 

-37.8 

11.7 

13.9 

100.0 

99.9 

100.0 

100.0 

0.969 

0.957 

magic 

-30.6 

-11.7 

-11.1 

-30.6 

-11.6 

-11.1 

100.0 

100.0 

100.0 

100.0 

0.974 

0.974 

miniboone 

-2.2e+09 

-7.17e+07 

-6.93e+07 

-2.42e+09 

-9.75 

-9.48 

99.8 

93.1 

100.0 

100.0 

0.999 

1.000 

optical 

-6.24e+15 

-5.66e+12 

-6.35e+12 

-6.06e+15 

-61.1 

-60.1 

100.0 

83.8 

100.0 

100.0 

1.000 

1.000 

pen-based 

-45.2 

-15.9 

-13.5 

-44.9 

-15.8 

-13.5 

100.0 

100.0 

100.0 

100.0 

0.927 

0.926 

qsar 

-4.02e+14 

-1.02e+03 

-1.03e+03 

-3.36e+14 

-37.2 

-36.9 

100.0 

99.7 

100.0 

100.0 

1.000 

1.000 

shuttle 

-5.42e+07 

-9.81 

-9.24 

-6.8e+07 

-9.37 

-8.76 

100.0 

96.9 

100.0 

100.0 

1.000 

1.000 

skin 

-125 

-3.84 

-3.45 

-125 

-3.84 

-3.45 

100.0 

100.0 

100.0 

100.0 

0.997 

0.997 

spambase 

-1.09e+16 

-81.6 

-81.3 

-9.76e+15 

-73.7 

-73.4 

100.0 

100.0 

100.0 

100.0 

1.000 

1.000 

spectf 

-78.6 

-53.6 

-53.1 

-54.5 

-36.8 

-36.5 

100.0 

97.5 

100.0 

100.0 

0.982 

0.985 





















Tabic 4: Results based on the error rates obtained from the 1000 experiments per data set for the supervised and our 
semi-supervised approach. Subsection 15.21 gives a description of the various criteria. 


data set 
(abbr.) 

estimated on test 

^•sup £semi ^opt 

estimated on full trn. 

^•sup ^semi ^opt 

% test wins 

semi opt 

sup semi 

% trn. wins 

semi opt 

sup semi 

Ssemi 

£sup 

£opt 

test 

£sup 

trn. 

banknote 

0.061 

0.052 

0.025 

0.061 

0.052 

0.024 

69.7 

89.7 

70.5 

89.3 

0.254 

0.240 

climate 

0.150 

0.143 

0.053 

0.133 

0.129 

0.034 

63.9 

99.8 

56.0 

100.0 

0.071 

0.033 

first-order 

0.666 

0.658 

0.529 

0.652 

0.650 

0.514 

75.9 

100.0 

55.3 

100.0 

0.055 

0.015 

gas 

0.141 

0.134 

0.085 

0.139 

0.133 

0.082 

68.5 

99.9 

65.7 

99.8 

0.134 

0.105 

landsat 

0.291 

0.251 

0.161 

0.285 

0.247 

0.153 

100.0 

100.0 

99.9 

100.0 

0.312 

0.286 

letter 

0.618 

0.599 

0.299 

0.615 

0.595 

0.294 

97.5 

100.0 

97.1 

100.0 

0.061 

0.060 

low 

0.763 

0.747 

0.696 

0.475 

0.501 

0.334 

70.0 

91.5 

2.2 

100.0 

0.233 

-0.181 

magic 

0.317 

0.303 

0.216 

0.316 

0.303 

0.216 

90.3 

100.0 

89.4 

99.8 

0.136 

0.134 

miniboone 

0.246 

0.229 

0.159 

0.246 

0.229 

0.159 

83.6 

99.9 

83.7 

99.9 

0.198 

0.197 

optical 

0.161 

0.113 

0.049 

0.154 

0.111 

0.042 

100.0 

100.0 

100.0 

100.0 

0.426 

0.385 

pen-based 

0.280 

0.243 

0.124 

0.278 

0.241 

0.122 

99.6 

100.0 

100.0 

100.0 

0.238 

0.234 

qsar 

0.257 

0.247 

0.154 

0.229 

0.226 

0.132 

65.7 

100.0 

53.1 

100.0 

0.089 

0.031 

shuttle 

0.134 

0.103 

0.059 

0.134 

0.103 

0.059 

82.1 

83.7 

81.7 

83.7 

0.415 

0.413 

skin 

0.098 

0.087 

0.068 

0.098 

0.087 

0.068 

79.8 

55.9 

79.8 

56.0 

0.365 

0.365 

spambase 

0.195 

0.185 

0.112 

0.189 

0.182 

0.108 

76.2 

99.8 

70.7 

100.0 

0.117 

0.086 

spectf 

0.325 

0.325 

0.260 

0.203 

0.210 

0.131 

41.7 

85.7 

21.6 

100.0 

-0.006 

-0.108 

















Table 5: Log-likelihood and error rate results obtained from the 1000 experiments per data set for the ad hoc semi- 
supervised approach and its comparison to our novel semi-supervised and regular supervised approach. Refer to Subsection 
15.21 for an explanation of the various criteria. 


data set 
(abbr.) 

test 

Lhoc 

trn. 

Lhoc 

test 

^hoc 

trn. 

^•hoc 

win test lik. 

hoc semi 

sup hoc 

win trn. lik. 
hoc semi 

sup hoc 

win test err. 
hoc semi 

sup hoc 

win trn. err. 
hoc semi 

sup hoc 

banknote 

-9.38 

-9.29 

0.087 

0.086 

73.8 

96.5 

74.0 

96.6 

30.1 

76.2 

30.6 

75.2 

climate 

-27 

-26.2 

0.117 

0.102 

100.0 

93.7 

100.0 

93.3 

79.9 

22.4 

81.1 

17.5 

first-order 

-68 

-43.7 

0.626 

0.616 

100.0 

100.0 

100.0 

100.0 

96.8 

7.6 

95.0 

5.8 

gas 

-5.66e+03 

-21.1 

0.145 

0.143 

100.0 

99.9 

100.0 

100.0 

44.7 

68.3 

42.9 

67.9 

landsat 

-16.8 

-16.2 

0.308 

0.302 

99.4 

100.0 

99.5 

100.0 

29.8 

98.6 

27.9 

98.0 

letter 

-53.1 

-52.9 

0.625 

0.622 

99.8 

100.0 

99.7 

100.0 

33.2 

92.4 

32.2 

92.9 

low 

-27.9 

9.42 

0.744 

0.485 

100.0 

100.0 

100.0 

100.0 

74.9 

39.3 

26.1 

16.4 

magic 

-12.4 

-12.4 

0.292 

0.292 

100.0 

80.7 

100.0 

80.7 

74.0 

37.8 

74.3 

38.9 

miniboone 

-7.65e+07 

-10.8 

0.218 

0.218 

99.7 

96.1 

100.0 

98.3 

73.1 

41.3 

72.6 

40.7 

optical 

-7.74e+15 

-7.48e+15 

0.900 

0.900 

29.5 

99.0 

32.7 

100.0 

0.0 

100.0 

0.0 

100.0 

pen-based 

-35.4 

-35 

0.299 

0.297 

98.9 

100.0 

99.1 

100.0 

24.5 

98.7 

24.8 

98.5 

qsar 

-1.51e+13 

-l.le+13 

0.229 

0.209 

100.0 

93.2 

100.0 

96.6 

86.9 

16.1 

83.8 

14.9 

shuttle 

-5.51e+05 

-5.82e+05 

0.822 

0.822 

1.6 

100.0 

1.6 

100.0 

1.6 

99.1 

1.6 

99.1 

skin 

-40.4 

-40.4 

0.102 

0.102 

94.7 

95.2 

94.7 

95.4 

40.1 

71.2 

40.6 

71.1 

spambase 

-1.66e+16 

-8.65e+15 

0.310 

0.307 

85.1 

100.0 

85.1 

100.0 

51.3 

51.0 

51.8 

48.4 

spectf 

-53.8 

-36.8 

0.293 

0.182 

100.0 

74.2 

100.0 

42.3 

71.0 

17.8 

78.4 

8.0 


















